Evaluating impact of receptive field in Encoder-Decoder and U-Net models for Lane Detection Segmentation
Invalid Date
Self-driving cars rely a lot on computer vision to make sense of the road. One of the most important things they need to do is detect lanes so the car knows where to drive safely.
Deep learning models, especially CNNs (Convolutional Neural Networks), are great at recognizing patterns in images. More specifically, they focus on parts of the image for making a decision.
That’s where something called the receptive field comes in. If the receptive field is too small or not well-designed, the model might miss important context, like curves or broken lane lines.
In this project, we explore how the size and shape of the receptive field impacts the model’s ability to accurately detect lanes using simulated driving scenes from the Carla simulator.
For this project we are using dataset based on the Carla (Car Learning to Act) Driving Simulator, an open-source simulator designed for autonomous driving research. Carla simulates driving scenarios that are applicable for training models for tasks like lane detection, object detection, and semantic segmentation.
This data set consists of total samples: 6408 png files = 3075 train + 3075 train_label + 129 val + 129 val_label images.
The receptive field of a neuron (or unit) in a CNN is the region of the input image that affects the output of that neuron.
If a neuron in a deep layer becomes active), the receptive field tells you which area of the original input image caused it to react.
Captures Context: Larger RFs help the network understand spatial relationships, objects, and background context.
Design Consideration: Helps decide kernel sizes, number of layers, and strides when building CNN architectures.
Segmentation Tasks: Accurate pixel-wise predictions require balancing fine details (small RF) and global context (large RF).
Deeper Layers = Larger RF: But also need to manage loss of spatial precision.
Method | Explanation Gradient-Based | Computes gradients of output w.r.t. input pixels to see which parts affect activation.
Backpropagation | Tracks how changes in pixel values affect activations through layers.
Occlusion Mapping | Slides a gray patch over parts of the input image and sees how output changes.
As you go deeper into the CNN, the receptive field increases.
Early layers capture edges and textures (small RFs).
Deeper layers capture shapes, parts, and full objects (large RFs).
Explanation: Kernel size controls the area that each filter covers. A larger kernel size increases the receptive field, allowing the model to capture more context in the image.
Example: Using a 5x5 kernel instead of a 3x3 kernel allows the model to capture broader information, improving its ability to detect lane markings over longer distances.
Explanation: Stride determines how much the filter moves over the input image. Increasing the stride reduces the spatial resolution of the feature maps but increases the effective receptive field.
Example: By using a stride of 2, instead of 1, we can increase the receptive field while reducing the computation required, which is helpful for detecting lanes over long stretches of road.
Explanation: Dilation expands the filter by spacing the elements, which increases the receptive field without increasing the number of parameters. This allows the network to capture larger contextual information.
Example: A dilated 3x3 convolution can help capture more context compared to a standard 3x3 filter, enabling the model to detect lane markings that span across larger sections of the image.
Classification - assign a label
Image classification — single label
Cat or Dog
Semantic Segmentation -> assign a label to each pixel
Output -> Mask
Matrix Dot Product
Sum of Element-Wise multiplication
Downsampling
Opposite of Convolution
Upsample
Expand
Convolution Layers in Encoder
Transposed Convolution Layer in Decoder
Double Convolution Layers
Transposed Convolution Layer in Expanding Path
Skip Connections (Concat) — Spatial Context
| Model | Parameters | Difference with UNet |
|---|---|---|
| CNN 3x3 | 3,139,587 | |
| UNet 3x3 | 31,037,763 | 27,898,176 |
| CNN 5x5 | 8,713,219 | |
| UNet 5x5 | 81,241,411 | 72,528,192 |
| CNN 7x7 | 17,073,667 | |
| UNet 7x7 | 156,546,883 | 139,473,216 |
PyTorch DataLoaders – Batch training
Epoch – 10
Loss – PyTorch CrossEntropy
| Model | Parameters | Difference with UNet |
|---|---|---|
| CNN 3x3 | 3,139,587 | |
| UNet 3x3 | 31,037,763 | 27,898,176 |
| CNN 5x5 | 8,713,219 | |
| UNet 5x5 | 81,241,411 | 72,528,192 |
| CNN 7x7 | 17,073,667 | |
| UNet 7x7 | 156,546,883 | 139,473,216 |
IoU
Dice Coefficient
CNN-5
UNet-3
| Model | Parameters | Difference with UNet |
|---|---|---|
| CNN 3x3 | 3,139,587 | |
| UNet 3x3 | 31,037,763 | 27,898,176 |
| CNN 5x5 | 8,713,219 | |
| UNet 5x5 | 81,241,411 | 72,528,192 |
| CNN 7x7 | 17,073,667 | |
| UNet 7x7 | 156,546,883 | 139,473,216 |
CNN-5 > UNet-3
The UNet-3 performs slightly better than the CNN-5
Dice score .885 > .85 and the IoU score of .805 > .755
CNN-5 is 75% smaller than UNet-3